Manuel Müller, University of Konstanz, manuel.mueller@uni-konstanz.de
Ismail Yildiz, University of Konstanz, ismail.yildiz@uni-konstanz.de
Peter Bak, University of Konstanz, peter.bak@uni-konstanz.de
STRAP – Structure based Sequence Alignment Program.
STRAP is a comfortable and comprehensive tool to edit
multiple protein sequence alignments. A wide range of functions related to protein
sequences and protein structures are accessible with an intuitive graphical
interface.
Author: Christoph Gille, Institute for Biochemistry ,
Charité Humboldt-University Berlin
Website: http://www.bioinformatics.org/strap/
Dendroscope
Dendroscope is a tool for visualizing phylogentic
trees and rooted networks.
Author: Daniel H. Huson
Website: http://www-ab.informatik.uni-tuebingen.de/software/dendroscope
Selfmade Java Tool
Our tool includes several functions like distance
calculation between sequences (number of substitutions from one sequence to
another)&xnbsp; and building a phylogentic tree datastructure.
Instructions: When asked for a folder, please select
the folder containing the dataset.
Author: Manuel Müller, Ismail Yildiz
Link: jar
Video:
ANSWERS:
MC3.1: What is the region or country of origin for the current
outbreak? Please provide your answer as the name of the native viral
strain along with a brief explanation.
To find the origin of the current outbreak we
used the tool STRAP which calculates pairwise distances and visualizes them in
a spring embedded graph.
Figure
1.1: Spring embedded dissimilarity graph.
In figure 1, the edge lengths
indicate the dissimilarity of sequences. We can see that Nigeria_B is the
native sequence with the shortest edge lengths to all the current outbreak
sequences. This means that Nigeria_B needs the least substitutions to become
one of the current outbreak sequences. So Nigeria_B
is most likely the origin of the current outbreak.
To prove this result, we calculated
the average distances from every native sequence to all current outbreak
sequences and visualized it with a bar chart (Figure 1.2).
Figure 1.2
Average distance of Native Sequences to the Current Outbreak.
MC3.2: Over time, the virus spreads and the diversity of the virus
increases as it mutates. Two patients infected with the Drafa virus are
in the same hospital as Nicolai. Nicolai has a strain identified by
sequence 583. One patient has a strain identified by sequence 123 and the
other has a strain identified by sequence 51. Assume only a single viral
strain is in each patient. Which patient likely contracted the illness
from Nicolai and why? Please provide your answer as the sequence number
along with a brief explanation.
To identify the person who likely contracted
the illness from Nicolai, we used the tool STRAP to generate a spring embedded
graph for the three sequences. The edge lengths in the graph indicate the
dissimilarity (number of substitutions) between the sequences.
Figure 2.1
Spring embedded graph
With this
visualization we can see, that sequence 123 is more similar to 583 than 51 to
583.
Further, we
used our own tool to identify the involved substitutions and got the following
results:
583 à123 &xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;(A->C@269)
583à51 &xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;&xnbsp;(A->C@494,
C->T@842, T->A@946)
With this result
we can say that it is more likely that the patient with Sequence 123 contracted
the illness from Nicolai than the patient with Sequence 51, because Sequence
123 is more similar to 583(distance=1) than Sequence 51 to 583(distance=3).
To check
our result, we can take a look at the way of evolution of the sequences.
To
visualize the evolution, we used a phylogenetic tree.
Figure 2.2
Phylogenetic tree (edge length represents the number of substitutions)
In Fig 2.2
we can see that >123 has the same path as Nicolai plus one additional
substitution, while >51 goes a different path.
MC3.3: Signs and symptoms of the Drafa virus are varied and humans
react differently to infection. Some mutant strains from the current
outbreak have been reported as being worse than others for the patients that
come in contact with them.
Identify the top 3 mutations that lead to an increase in symptom
severity (a disease characteristic). The mutations involve one or more
base substitutions. For this question, the biological properties of the
underlying amino acid sequence patterns are not significant in determining
disease characteristics.
For each mutation provide the base substitutions and their position in
the sequence (left to right) where the base substitutions occurred. For
example,
C → G, 456 (C changed to G at position 456)
G → A, 513 and T → A, 907 (G changed to A at position 513
and T changed to A at position 907)
A → G, 39 (A changed to G at position 39)
In our approach to identify the top 3 mutation
that lead to an increase in symptom severity, we first assigned each sequence a
value for its symptom severity (mild=0, moderate=1, severe=2). After that we
used our Java Tool to build a phylogenetic tree data structure from the
sequences. With this tree we were able to identify the sequences that include a
specific substitution (which is represented as an edge) or not.
Then we calculated for every substitution a weighted average increase value
using the following formula:
Including_symp_sum: sum of symptom values of
sequences including the substitution.
To get a visual overview of these results, we
generated a phylogentic tree with STRAP, exported and redesigned it in
Dendroscope (Figure 3.1). Nodes represent Sequences, edges represent
substitutions, color is mapped to symptom severity and every substitution has
its weighted average increase value in square brackets.
Figure 3.1
Phylogenetic Tree Visualization of the current outbreak in Dendroscope.
In Figure 3.1 we marked the top 3 substitutions that lead to an increase
in symptom severity with red:
AàT, 946 (A changed to T at position 946)
TàC, 842 (T changed to C at position 842)
GàA, 223 (G changed to A at position 223)
Figure 3.2
Bar chart of weighted average increase for all substitutions.
MC3.4: Due to the rapid spread of the virus and limited resources,
medical personnel would like to focus on treatments and quarantine procedures
for the worst of the mutant strains from the current outbreak, not just
symptoms as in the previous question. To find the most dangerous viral
mutants, experts are monitoring multiple disease characteristics.
Consider each virulence and drug resistance characteristic as equally
important. Identify the top 3 mutations that lead to the most dangerous
viral strains. The mutations involve one or more base substitutions. In a
worst case scenario, a very dangerous strain could cause severe symptoms, have
high mortality, cause major complications, exhibit resistance to anti viral
drugs, and target high risk groups. For this question, the biological
properties of the underlying amino acid sequence patterns are not significant
in determining disease characteristics.
For each mutation provide the base substitutions and their position in
the sequence (left to right) where the base substitutions occurred. For
example,
C → G, 456 (C changed to G at position 456)
G → A, 513 and T → A, 907 (G changed to A at position 513
and T changed to A at position 907)
A → G, 39 (A changed to G at position 39).
In our approach to identify the top 3 mutation
that lead to the most dangerous viral strains, we first transformed the ordinal
values from the characteristic table into numeric values (mild=0,moderate=1,
severe=2). After that we used our Java Tool to build a phylogenetic tree data
structure from the sequences. With this tree we were able to identify the sequences
that include a specific substitution (which is represented as an edge) or not.
Then we calculated for each substitution a
characteristic sum for every characteristic of all sequences including the
substitution and a characteristic sum for all sequences excluding the
substitution. The including and excluding sequences can easily be distinguished
with the phylogenetic tree structure. After that we generated for each
characteristic an average value by dividing the sums by the number of involved
sequences. The difference of these two values indicates either an increase
(positive value) or a decrease (negative value) in the specific characteristic
severity. If the difference is a positive value, it means that the sequence
subtree of the sequences including the substitution has a higher average
characteristic sum than the sequence subtree not including the substitution.
Finally we weight the average increase with the number of occurrences of a
substitution. With these values we can distinguish substitutions that cause an
increase or decrease in the specific characteristic. We sum these weighted
average increase values for each characteristic up to get a value that
indicates the overall increase of the characteristics.
To get a visual overview of these results, we imported
the current outbreak sequences into STRAP and generated a phylogentic tree
(Figure 4.1).
Figure 4.1
Phylogenetic Tree Visualization in STRAP
Because of the lack of customization settings
in STRAP, we decided to export the tree and import it in Dendroscope, which
offers a lot more customization.
Figure 4.1
Phylogenetic Tree Visualization of the current outbreak in Dendroscope.
&xnbsp;In Figure 4.1 the nodes represent
sequences, the color of the nodes is mapped to the characteristic sum (sum of
the numeric values over all characteristics) and edges represent substitutions,
while every substitution has its weighted average overall increase value in
square brackets.
With this visualization we can easily find the
worst substitutions that lead to an overall increase in characteristics by
picking out the ones with the highest average overall increase value (The top
three are marked red).
AàT, 946 (A changed to T at position 946)
TàC, 842 (T changed to C at position 842)
TàC, 790 (T changed to C at position 790)
&xnbsp;A complete list of the
substitutions and their overall characteristic increase values are provided in
Figure 4.3.
Figure 4.3
Bar chart of weighted average overall increase for all substitutions.
The estimated time for processing the question was about 30-40 hours.